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Abstract 

The convolution between a text string S of length N and a pattern string P of 
length m can be computed in 0(N log m) time by FFT. It is known that various 
types of approximate string matching problems are reducible to convolution. In 
this paper, we assume that the input text string is given in a compressed form, as 
a straight-line program (SLP), which is a context free grammar in the Chomsky 
normal form that derives a single string. Given an SLP S of size n describing a 
text S of length N, and an uncompressed pattern P of length m, we present a 
simple O (nm log m) -time algorithm to compute the convolution between S and 
P. We then show that this can be improved to 0(min{nm, N — a} log m) time, 
where a > is a value that represents the amount of redundancy that the SLP 
captures with respect to the length-m substrings. The key of the improvement is 
our new algorithm that computes the convolution between a trie of size r and a 
pattern string P of length m in 0(r log m) time. 

1 Introduction 

String matching is a task of find all occurrences of a pattern of length m in a text of 
length N. In various fields of computer science such as bioinformatics, image anal- 
ysis and data compression, detecting approximate occurrences of a pattern is of great 
importance. Fischer and Paterson [8 | found that various approximate string matching 
problems can be solved efficiently by reduction to convolution, and many studies have 
followed since. For instance, it was shown in [8 1 that the string matching problem with 
don't cares can be solved in 0(N log m log a) time, where a is the alphabet size. This 
was later improved to 0(N log m) time J6]|5]. An 0(N^m log m)-time algorithm for 
computing the Hamming distances between the pattern and all text substrings of length 
m was proposed in (TJ. 

Many, if not all, large string data sets are stored in a compressed form, and are 
later decompressed in order to be used and/or analyzed. Compressed string processing 
(CSP) arose from the recent rapid increase of digital data, as an approach to process 
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a given compressed string without explicitly decompressing the entire string. A lot 
of CSP algorithms have been proposed in the last two decades, which improve on 
algorithms working on uncompressed strings, both in theory |[T5l [7] [14l [TO) and in 
practice lfT9l[T2l[T3l 

The goal of this paper is efficient computation of the convolution between a com- 
pressed text and an uncompressed pattern. In this paper, we assume that the input string 
is represented by a straight-line program (SLP), which is a context free grammar in the 
Chomsky normal form that derives a single string. It is well known that outputs of 
various grammar based compression algorithms [H71I161 , as well as those of dictionary 
compression algorithms l|23ll2Tll22l|20l , can be regarded as, or be quickly transformed 
to, SLPs fl8l . Hence, algorithmic research working on SLPs is of great significance. 
We present two efficient algorithms that compute the convolution between an SLP- 
compressed text of size n and an uncompressed pattern of length to. The first one 
runs in 0(nm log m) time and space, which is based on partial decompression of the 
SLP-compressed text. Whenever nm = o(N), this is more efficient than the existing 
FFT-based 0(N log to) -time algorithm for computing the convolution of a string of 
length N and a pattern of length to. However, in the worst case n can be as large as 
O(N). Our second algorithm deals with such a case. The key is a reduction of the 
covolution of an SLP and a pattern, to the convolution of a trie and a pattern. We show 
how, given a trie of size r and pattern of length to, we can compute the convolution 
between all strings of length m in the trie and the pattern in 0(r logm) time. This 
result gives us an 0(min{nm, N — a} logm)-time algorithm for computing the con- 
volution between an SLP-compressed text and a pattern, where a > represents a 
quantity of redundancy of the SLP w.r.t. the substrings of length m. Notice that our 
second method is at least as efficient as the existing 0(N log to) algorithm, and can be 
much more efficient when a given SLP is small. Further, our result implies that any 
string matching problems which are reducible to convolution can be efficiently solved 
on SLP-compressed text. 

1.1 Related work 

In 0, an algorithm which computes the convolution between a text and a pattern, us- 
ing Lempel-Ziv 78 factorization [23 1, was proposed. Given a text of length N and a 
pattern of length to, the algorithm in [9 | computes the convolution in 0(N + mL) time 
and space, where L is the number of LZ78 factors of the text. The authors claimed that 
L = 0( lo g N h), where < h < 1 is the entropy of the text. However, this holds 
only on some strings over a constant alphabet, and even on a constant alphabet there 
exist strings with L = 0( lo ^ N ) Q- Moreover, when the text is drawn from integer 
alphabet £ = [1, N], then clearly L = Q(N). In this case, the algorithm of [9 1 takes at 
least 0(mN) time (excluding the time cost to compute the LZ78 factorization). Since 
the LZ78 encoding of a text can be seen as an SLP, and since the running time of our 
algorithm is independent of the alphabet size, this paper presents a more efficient algo- 
rithm to compute the convolution on LZ78-compressed text over an integer alphabet. 
Furthermore, our algorithm is much more general and can be applied to arbitrary SLPs. 
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2 Preliminaries 



2.1 Strings 

Let E be a finite alphabet. An element of E* is called a string. The length of a string 
S is denoted by \S\. The empty string e is a string of length 0, namely, |e| = 0. 
For a string S = XYZ, X, Y and Z are called a prefix, substring, and sm^x of 5, 
respectively. The i-th character of a string S is denoted by S[i], where 1 < i < \S\. 
For a string S and two integers 1 < i < j < \S\, let S[i : j] denote the substring of S 
that begins at position i and ends at position j. 

Our model of computation is the word RAM: We shall assume that the computer 
word size is at least log 2 \ S\, and hence, standard operations on values representing 
lengths and positions of string S can be manipulated in constant time. Space complex- 
ities will be determined by the number of computer words (not bits). 

2.2 Convolution 

Let Vs and Vp be two vectors on some field whose lengths are N and m, respectively, 
with m < N. The convolution C between Vs and Vp is defined by 



for 1 < i < N — m + 1. It is well-known that the vector C can be computed in 
0(N log m ) time by FFT The algorithm samples Vs at every (km + l)-th position of 
Vs for < k < L^J- F° r eacn sampled position the algorithm is able to compute 
the convolution between the subvector Vs[km + 1 : (k + 2)m] of length 2m and Vp 
in O(mlogm) time, and therefore the whole vector C can be computed in a total of 
0(N log m) time. 

We can solve several types of approximate matching problems for a text S of length 
N and a pattern P of length m, by suitably mapping characters P[j] and S[i + j — 
1] to numerical values. For example, let 4> a (x) = 1 if x = a and otherwise, for 
any a £ E, then X^aeE 'PaiPij}) ■ <t>a{S[i + j — 1]) represents the number of 

matching positions when the pattern is aligned at position i of the text. Consequently, 
the Hamming distances of the pattern and the text substrings for all positions 1 < 
i < N — m + 1 can be computed in a total of 0(|E|iV log to) time, by computing 
convolution using mappings cf> a for all a E E and summing them up, which is a classic 
result in OD. 

For convenience, in what follows we assume strings S and P on integer alphabet, 
and consider convolution between S and P. 

2.3 Straight Line Programs 

A straight line program (SLP) is a set of assignments S = {X\ — > expr\,X2 — > 
expr2, ■ ■ ■ , X n —} expr n }, where each X{ is a variable and each expri is an expres- 
sion, where expri = a (a G E), or expri = Xi^X r ^ (i > £(i), r(i)). It is essentially 
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Figure 1: The derivation tree of SLP S = {X x -A a, X 2 -)■ b, X 3 -> X : X 2 , 
X4 — > X1X3, X 5 — > X3X4, X 6 — > X4X5, X 7 -» X 6 X 5 }, representing string 
5 = v(lI{Xt) = aababaababaab. 



a context free grammar in the Chomsky normal form, that derives a single string. Let 
val(Xi) represent the string derived from variable Xi. To ease notation, we some- 
times associate val(Xi) with Xi and denote \ val(Xi)\ as |X|, and val(Xi)([u : v]) as 
Xi([u : v]) for any interval [u : v]. An SLP S represents the string S — val(X n ). The 
size of the program 5 is the number n of assignments in S. Note that \ S\ can be as large 
as 8(2"). However, we assume as in various previous work on SLP, that the computer 
word size is at least log 2 \S\, and hence, values representing lengths and positions of S 
in our algorithms can be manipulated in constant time. 

The derivation tree of SLP S is a labeled ordered binary tree where each internal 
node is labeled with a non-terminal variable in {X\, . . . , X n }, and each leaf is labeled 
with a terminal character in S. The root node has label X n . Let V denote the set of 
internal nodes in the derivation tree. For any internal node v <G V, let (v) denote the 
index of its label X/ v \ . Node v has a single child which is a leaf labeled with c when 
(X/ v \ — > c) G S for some c € E, or v has a left-child and right-child respectively 
denoted £(v) and r(v), when (X^ — > X/g^'uX/^,,))) £ 5. Each node v of the 
tree derives val(X^), a substring of S, whose corresponding interval itv(v), with 
S(itv(v)) = val(X^), can be defined recursively as follows. If v is the root node, 
then itv(v) — [1 : \S\]. Otherwise, if (X/ v \ — > X/M v \\X/ r r v \\) G S, then, itv(£(v)) — 
[b v : b v + \X Wv) )\- 1] and itv(r(v)) = [b v + : e v ], where [b v : e v ] = itv(v). 

For any interval [b : e] of S^l < b < e < \S\), let £,s(b, e) denote the deepest node 
v in the derivation tree, which derives an interval containing [b : e], that is, itv(v) 3 
[b : e], and no proper descendant of v satisfies this condition. We say that node v stabs 
interval [b : e], and X^ is called the variable that stabs the interval. If b = e, we have 
that (X/ v \ — > c) 6 S for some c 6 S, and itv(v) = b = e. If b < e, then we have 
(X( v ) — > X(^( t ,))X( r (^))) e S, b e itv(£(v)), and e G itv(r{v)). 

Theorem 1 ([4]). Given an SLP S = {X - ^ ea^r^}^]^, /f /s possible to pre-process 
S in 0{n) time and space, so that for any interval [b : e] of S, 1 < b < e < N, its 
stabbing variable X/£ s n e \\ can be computed in 0(log N) time. 

SLPs can be efficiently pre-processed to hold various information. \Xi\ can be 
computed for all variables Xj(l < i < n) in a total of 0(n) time by a simple dynamic 
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programming algorithm. Also, the following lemma is useful for partial decompression 
of a prefix of a variable. 

Lemma 1 ( Bill ). Given an SLP S = {Xi — > expri}" =1 , it is possible to pre-process 
S in 0(n) time and space, so that for any variable Xi and 1 < q < \Xi \, the prefix of 
val(Xi) of length q, i.e. val(Xi)[l : q], can be computed in 0(q) time. 

2.4 Problem 

In this paper we tackle the following problem. 

Problem 1. Given an SLP S = {Xi — > expri}™ =1 describing a text S and an uncom- 
pressed pattern P of length m, compute a compact representation of the convolution 
C between S and P. 

By "compact representation" above, we mean a representation of convolution C 
whose size is dependent (and polynomial) on n and m, and not on N = \S\. In the 
following sections, we will present our algorithms to solve this problem. We will also 
show that given a position i of the uncompressed text S with 1 < i < N — m + 1, our 
representation is able to return C[i] quickly. 

3 Basic algorithm 

In this section, we describe our compact representation of the convolution C for a string 
S represented as an SLP S of size n and a pattern P of length m. Our representation 
is based on the fact that the value of the convolution depends only on the substrings 
of length to of S. We use compact representations of all substrings of length to of S, 
which were proposed in |[T2l[T3l . 

For any variable Xj = XgX r ,\eltj — suf(Xg,m—\)pre{X ri m—l). Namely, t,- 
is the substring of val(Xj) obtained by concatenating the suffix of val(Xi) of length at 
most to — 1, and the prefix of val(X r ) of length at most to — 1 (see also Figure|2|l. By 
the arguments of Section [23] there exists a unique variable Xj that stabs the interval 
[i : i + m— 1]. Hence, computing C reduces to computing the convolution between tj 
and pattern P for all variables Xj. 

Theorem 2. Given an SLP S of size n representing a string S of length N, and pattern 
P of length to, we can compute an 0(nm)-size representation of convolution C for 
S and P in 0(nm log to) time. Given a text position I < i < N — m + 1, our 
representation returns C[i] in 0(logN) time. 

Proof. Let tj — suf(Xg, m — l)pre(X r , to — 1) for any variable Xj = XgX r . Since 
\tj | < 2m — 2, we can compute each tj in 0{m) time by LemmaQ] We then compute 
the convolution between tj and P in 0(m log m) time using the FFT algorithm. Since 
there are n variables, it takes a total of 0{jiui log to) time and the total size of our 
representation is 0(nm). 

By Theorem[T]we can compute the stabbing variable in O(logiV) time. It is also 
possible to compute in 0(log N) time the text position corresponding to the node of the 
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Figure 2: Substring £j of val(Xj). 



derivation tree of S representing the stabbing variable [3||. Thus C[i] can be answered 



By a similar argument to Section 7 in |0], we obtain the following: 

Theorem 3. Given a compact representation of the convolution between a string S and 
a pattern P described above, we can output the set occ of all approximate occurrences 
of P in S in 0(\occ\) time. 

4 Improved algorithm 

The algorithm of the previous section is efficient when the given SLP is small, i.e., 
nm = o(N). However, n can be as large as O(N), and hence it can be slower than the 
existing FFT-based 0(N log m)-time algorithm. 
To overcome this, we use the following result: 

Lemma 2 ([13]). For any SLP S of size n describing a text S of length N, there exists 
a trie T of size 0(mm{nm, N — a}) with a > 0, such that for any substring Q of 
length m of S, there exists a directed path in T that spells out Q. The trie T can be 
computed in linear time in its size. 

Here a is a value that represents the amount of redundancy that the SLP captures 
with respect to the length-m substrings, which is defined by a — '}2,{{vOcc(Xj) — 1) • 
(\tj\ — (to — 1)) | \Xj\ > m,j = 1, . . . ,n}, where vOcc(Xj) denotes the number of 
times a variable Xj occurs in the derivation tree, i.e., vOcc(Xj) = \{v \ X/ v \ = Xj}\. 

By the above lemma, computing the convolution between an SLP-compressed string 
and a pattern reduces to computing the convolution between a trie and a pattern. In the 
following subsection, we will present our efficient algorithm to compute the convolu- 
tion between a trie and a pattern. 



in 0(log N) time. 



□ 
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Figure 3: Instance of input trie T. 




Figure 4: The convolution between T and pattern 5 2 4 1 3. The value in each node 
is the value of the convolution for the node and the pattern. The nodes with depth less 
than \P\ — 5 are left blank. 

4.1 Convolution between trie and pattern 

Here we consider the convolution between a trie T and a pattern P 6 E + . For any 
node v of T and a positive integer k, let strx{v, k) be the suffix of the path from the 
root of T to v of length min{/s, depth(v)}. The subproblem to solve is formalized as 
follows: 

Problem 2. Given a trie T and a pattern P of length m, for all nodes vofT whose 
depth is at least m, compute Ct{v) — X)j=i s t r T{v, m )[j]P[j]- 

Figure [3] illustrates an instance of Problem |2] Figure |4] shows the values of the 
convolution between the trie of Figure [3] and pattern 5 2 4 1 3. 

Theorem 4. Problem\2\can be solved in 0(r log m) time, where r is the size ofT and 
m = \P\. 

Proof. Assume that the height of T is at least m since otherwise no computation is 
needed. We show how to compute Ct{v) in O(logm) amortized time for each node 
v € T. We consider the long path decomposition such that T is decomposed into its 
longest path and a forest consisting of the nodes that are not contained in the longest 
path. We recursively apply the above decomposition to all trees in the forest, until each 
subtree consists only of a single path. Figure [5] shows the long path decomposition 
of the trie shown in Figure [3] It is easy to see that we can compute the long path 
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Figure 5: The long path decomposition of the trie shown in Figure [3] 



decomposition in 0(r) time. For each path, we compute the convolution by FFT. Let 
(wi , W2, • • • , Wd) be one of the long paths, where d is the number of nodes on the path. 

• When d > to: It is enough to compute the convolution between strxiwd, d + 
to— 1) and P, which takes 0((d+m— 1) logm) time, i.e., 0{ ( d+ ™~ 1 '> logm) = 
0(log m) time per node. 

• When d < to: The same method costs too much, i.e., 0( ( d+7 ^~ 1 ) logm) time 
per node, and thus we need a trick. The assumption that the height of T is at 
least to implies that w\ is not the root of T since otherwise, the longest path 
in T would be (w\, W2, ■ ■ ■ , Wd), and d — 1(< to) would be the height of T, a 
contradiction. Consequently, from the definition of the long path decomposition, 
there must exist a path (not necessarily a long path) (21, z%, . . . , Zd) such that 
Wi z\ and parent (wi) = parent (zi). For any 1 < i < d with depth(wi) > 
to, Criwi) can be written as follows: 

m — d ra 

C T (wi) = ^2 str T (w l ,m)[j]P[j}+ ^2 str T (wi,m)\j]P\j] 

j — 1 j — m.—d+l 

m — d m 

= ^2 str T (z l ,m)[j]P[j}+ ^2 str T(w l ,m)[j]P[j] 
3=1 j=m-d+l 

m m 

= C T { Zi )- str T (zi,m)\j}P\j]+ str T { Wi ,m)\j]P\j] 

j—7n — d-\-l j— m— d+1 

= CT{Zi)-C T (zi)+C T {Wi), 

where C' T (v) = YJj= m -d+i str r{v, m)\j]P\j]. For all 1 < i < d, C' T {wi) 
(resp. C' T (zi)) can be computed in 0((d + d — 1) logd) time by convolution 
between strT(wd, d + d — 1) (resp. strxizd, d + d— 1)) and P[m — d + 1 : to]. 
Therefore, assuming that Cr{zi) is already computed for all 1 < i < d, we can 
compute Ct(wj) for all 1 < i < d in Q( ^ logd) = O(logTO) time per 
node. 

It follows from the above discussion that we can solve Problem|2]in 0(r log to) time 
by computing values of convolution by the longest path first and making use of the 
result when encountering a short path whose length is less than m. □ 
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We obtain the main result of this paper: 

Theorem 5. Given SLP S of size n representing a string S of length N, and pattern 
P of length m, we can compute an 0(mm{nm, N ~ a})-size representation of convo- 
lution C for S and P in 0(mm{nm, N — a} logm) time, where a > 0. Given a text 
position 1 < i < N — m+1, our representation returns C[i] in 0(log N) time. 

We note that a similar result to Theorem [3]holds for our 0(mm{nm, N — a})-size 
representation of convolution, and hence we can compute all approximate occurrences 
in time linear in its size. 

5 Conclusions and future work 

In this paper we showed how, given an SLP-compressed text of size n and an uncom- 
pressed pattern of length m, we can compute the convolution between the text and 
the pattern efficiently. We employed an 0(min{nm, N — a})-size trie representa- 
tion of all substrings of length m in the text, which never exceeds the uncompressed 
size N of the text. By introducing a new technique to compute the convolution be- 
tween a trie of size r and a pattern of length m in 0(r logm) time, we achieve an 
0(min{nm, N — a} logm)-time solution to the problem. A consequence of this re- 
sult is that, for any string matching problem reducible to convolution, there exists a 
CPS algorithm that does not require decompression of the entire compressed string. 

However, it is not yet obvious whether we can straightforwardly adapt an algorithm 
which also uses techniques other than convolution, such as the one in |2|. Future work 
of interest is to clarify the above matter, and to implement our algorithms and conduct 
experiments on highly compressible texts. 
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